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Abstract 

The amount of personal information unwillingly exposed by users on online social networks 
is staggering, as shown in recent research. Moreover, recent reports indicate that these networks 
are infested with tens of millions of fake users profiles, which may jeopardize the users' security 
and privacy To identify fake users in such networks and to improve users' security and privacy, 
we developed the Social Privacy Protector software for Facebook. This software contains three 
protection layers, which improve user privacy by implementing different methods. The software 
first identifies a user's friends who might pose a threat and then restricts this "friend's" exposure 
to the user's personal information. The second layer is an expansion of Facebook's basic privacy 
settings based on different types of social network usage profiles. The third layer alerts users about 
the number of installed applications on their Facebook profile, which have access to their private 
information. An initial version of the Social Privacy Protection software received high media cover- 
age, and more than 3,000 users from more than twenty countries have installed the software, out of 
which 527 used the software to restrict more than nine thousand friends. In addition, we estimate 
that more than a hundred users accepted the software's recommendations and removed at least 
1,792 Facebook applications from their profiles. By analyzing the unique dataset obtained by the 
software in combination with machine learning techniques, we developed classifiers, which are able 
to predict which Facebook profiles have high probabilities of being fake and therefore, threaten 
the user's well-being. Moreover, in this study, we present statistics on users' privacy settings and 
statistics of the number of applications installed on Facebook profiles. Both statistics are obtained 
by the Social Privacy Protector software. These statistics alarmingly demonstrate how exposed 
Facebook users information is to both fake profile attacks and third party Facebook applications. 

Keywords. Social Network Security and Privacy, Fake Profiles, Online Social Networks, Facebook, 
Supervised Learning, Facebook Application, Facebook Friends Statistics, Facebook Applications 
Statistics, Facebook Users Privacy Settings. 



1 Introduction 

In recent years, online social networks have grown rapidly and today offer individuals endless possi- 
bilities for publicly expressing themselves, communicating with friends, and sharing information with 
people across the world. A recent survey [28j estimated that 65% of adult internet users use online 
social network sites, such as Twitter [40j, Linkedin ^26], Google+ [18], and Facebook |9]. As of October 
2012, the Facebook social network has more than one billion active users monthly [13]. On average, 
Facebook users have 138 friends and upload more than 219 billion pictures onto Facebook [13]. More- 
over, according to the Nielsen "Social Media Report" [31j, American Internet users spent more than 
53.5 billion minutes on Facebook in the month of May 2011, making Facebook the leading web-brand 
in the United-States. 

Due to the friendly nature of Facebook, users tend to disclose many personal details about them- 
selves and about their connections. These details can include date of birth, personal pictures, work 
place, email address, high school name, relationship statuses, and even phone numbers. Moreover, 
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Bosmaf et al. [4J, discovered that an average of 80% of studied Facebook users accepted friend re- 
quests from people they did not know if they shared more than 11 mutual friends. In many cases, 
accepting friend requests from strangers may result in the exposure of a user's personal information 
to third parties. Additionally, personal information of Facebook users can be exposed to third party 
Facebook applications [8J. Another privacy concern deals with existing privacy settings, for which the 
majority of Facebook users do not match the security expectations [27J. These results indicate that 
many users accidently or unknowingly publish private information leaving them more exposed than 
they assumed. 

If a user's personal information is disclosed to a malicious third party, it can be used to threaten 
the well-being of the user both online and in the real world. For example, a malicious user can use 
the gained personal information and send customized spam messages to a user in an attempt to lure 
such users onto malicious websites [39j, or blackmail them into transferring money to the attacker's 
account |30| . To cover their tracks, social network attackers may use fake profiles. In fact, the number 
of fake profiles on Facebook can be counted in the tens of millions. According to a recent report [12], 
Facebook estimates that 8.7% (83.09 million), of its accounts do not belong to real profiles. Moreover, 
Facebook also estimates that 1.5% (14.32 million), of its accounts are "undesirable accounts", which 
belong to users who may deliberately spread undesirable content, such as spam messages and malicious 
links, and threaten the security and privacy of other Facebook users. 

In this study, we present the Social Privacy Protector software for protecting user privacy on 
Facebook (otherwise referred to as SPP). The SPP software consists of two main parts, namely, a 
Firefox add-on and a Facebook application. The two parts provide Facebook users with three different 
layers of protection. The first layer, which is part of the Firefox add-on, enables Facebook users to 
easily control their profile privacy settings by simply choosing the most suitable profile privacy settings 
with just one click. The second layer, which is also part of the software Firefox add-on, notifies users 
of the number of applications installed on their profile which may impose a threat to their privacy. 
The third layer, a Facebook application, analyzes a user's friends list. By using simple heuristics 



(see Section 4.1), the application identifies which friends of a user are suspected as fake profiles and 
therefore impose a threat on the user's privacy. The application presents a convenient method for 
restricting the access of fake profiles to a user's personal information without removing them from the 
user's friends list. 

At the end of June 2012, we launched an initial version of the SPP software as a "free to use 
software" [HI [15] and received massive media coverage with hundreds of online articles and interviews 
in leading blogs and news websites, such as Fox news [3] and NBC news [32]. Due to the media 
coverage, in less than four months, 3,017 users from more than twenty countries installed the SPP 
Facebook apphcation, 527 of which used the SPP Facebook apphcation to restrict 9,005 friend^ 
Moreover, at least 1,676 users installed the Firefox add-on out of which we estimate that 111 users 
used the add-on recommendation and removed more than 1,792 Facebook applications from their 



profiles (see Section 5.1). In addition, the add-on also succeeded in collecting the Facebook privacy 
settings of 67 different Facebook users. 

To our great surprise many of the SPP application users used the application to not only remove 
users that were recommended for removal, but to also manually search and restrict them by name, 
specific friends that have a higher likelihood of having profiles that belong to real people. The removal 
of real profiles also assists us in studying and constructing classifiers that identify real profiles recom- 
mended for restriction. The collected data obtained from SPP users gave us a unique opportunity to 
learn more about user privacy on online social networks in general, and on Facebook in particular. 
Also, by using the unique data obtained from users restricting their Facebook friends, as well as by 
implementing machine learning techniques, we developed classifiers, which can identify user's friends 



that are recommended for restriction (see Section 4.2). Our classifiers presented an AUG of up to 



0.948, precision at 200 of up to 98%, and an average users precision at 10 of up to 24% (see Section [s]). 
Furthermore, these types of classifiers can also be used by online social network administrators to 



^Due to the unexpected massive downloads and usage of the apphcation our servers did not succeeded in supporting 
the large amount of users at once. Moreover, in our initial version, the SPP Facebook application did not support all 
the existing web browsers. Therefore, many users who installed the SPP software did not have the possibility to use it 
on demand. 
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identify and remove fake profiles from the online social network. 

In this study we also present statistics on Facebook user privacy settings, which were obtained by 
the SPP Add-on. These statistics demonstrate how exposed Facebook users' personal information is 
to fake profile attacks and third party applications (see Section 5.3). For example, we show that out 
of 1,676 examined Facebook users, 10.68% have more than a hundred Facebook applications installed 
on their profile, and 30.31% of the users have at least 40 Facebook applications installed. Moreover, 
out of 67 collected users' privacy settings the majority of the user's personal information is set up to 
be exposed to friends leaving the user's personal information exposed to fake friends. 

The remainder of this paper is organized as follows. In Section [2j we provide an overview of 
various related solutions, which better help protect the security and privacy of social network users. 
In addition, we also present an overview of similar studies, which used machine learning techniques 
to predict user properties, such as predicting users' links in social networks. In Section |3) we describe 
the SPP software architecture in detail. In Section |4} we describe the methods and experiments used 
in this study. We describe the initial deployment of the SPP software and the methods used for the 
construction and evaluation of our machine learning classifiers. In Section [5) we present the results of 
our study, which include an evaluation of our classifiers and different users' privacy statistics obtained 
by the SPP software. In Section [Hj we discuss the obtained results. Lastly, in Section [7| we present 
our conclusions from this study and offer future research directions. 



2 Related Work 

2.1 Online Social Network Security and Privacy 

In recent years, due to the increasing number of privacy and security threats on online social network 
users, social network operators, security companies, and academic researchers have proposed various 
solutions to increase the security and privacy of social network users. 

Social network operators attempt to better protect their users by adding authentication processes 
to ensure that a registered user represents a real live person [22j. Many social network operators, 
like Facebook, also offer their users a configurable user privacy setting that enables users to secure 
their personal data from other users in the network \27\ |29| . Additional protection may include a 
shield against hackers, spammers, socialbots, identity cloning, phishing, and many other threats. For 
example, Facebook users have an option to report users in the network who harass other users in the 
network [11]. In addition, Facebook also developed and deployed an Immune System, which aims to 
protect its user from different online threats |38] . 

Many commercial and open source products, such as Checkpoint's SocialGuard [44j, Websense's 
Defensio [7J, UnitedParents [41j, RecalimPrivacy [36j, and Priv Aware application [33j, offer online 
social network users tools for better protecting themselves. For example, the Websense's Defensio 
software aims to protect its users from spammers, adult content, and malicious scripts on Facebook. 

In recent years, several published academic studies have proposed solutions for various social net- 
work threats. DeBarr and Wechsler [6J used the graph centrality measure to identify spammers. 
Wang [l2] presented techniques to classify spammers on Twitter based on content and graph features. 
Stringhini et al. [39] presented a solution for detecting spammers in social networks by using "honey- 
profiles". Egele et al. f8j presented PoX, an extension for Facebook, which makes all requests for private 
data explicit to the user. Yang et al. ^43j presented a method to identify fake profiles by analyzing dif- 
ferent features, such as links' creation timestamps, and friend requests frequency. Anwar and Fong [2] 
presented the Reflective Policy Assessment tool, which aids users in examining their profiles from the 
viewpoint of another user in the network. Rahman et al. [34j presented the MyPageKeeper Facebook 
application, which aims to protect Facebook users from damaging posts on the user's Facebook wall. 
In a later study [35], Rahman et al. also presented the FRAppE application for detecting malicious 
applications on Facebook. They discovered that 13% of one hundred and eleven thousand Facebook 
applications in their dataset were malicious applications. Recently, Fire et al. ^6J proposed a method 
for detecting fake profiles in online social networks based on anomalies in a fake user's social structure. 

In this study, we present the the SPP software, which offers methods for improving Facebook user 
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privac}]^ By using data collected by the SPP software and machine learning techniques, we present 
methods for constructing classifiers that can assist in identifying fake profiles. 

2.2 Online Social Networks and Machine Learning 

With the increasing popularity of social networks many researchers had studied and used a combina- 
tion of data obtained from social networks and machine learning techniques to predict different user 
properties [25l|Tl[33. Furthermore, several studies used machine learning techniques to improve user 
security in online social networks (391 ESI [16]. 

In 2007, Liben-Nowell and Kleinberg [25] used machine learning techniques to predict links between 
users in different social networks (also referred to as the link prediction problem). In 2010, Stringhini 
et al. [39] proposed a method for detecting spammer profiles by using supervised learning algorithms. 
In the same year, Lee et al. [23j used machine learning and honeypots to uncover spammers in MySpace 
and Twitter. Sakaki et al. [37j used machine learning and content data analysis of Twitter users in 
order to detect events, such as earthquakes and typhoons, in real-time. In 2012, Altshuler et al. [Ij used 
machine learning techniques to predict different user's properties, such as origin and ethnicity, inside 
the "Friends and Family" social network, which was created by logs extracted from a user's mobile 
device. Recently, Fire et al. p6] used the online social network's topological features to identify fake 
users in different online social networks. 

As part of this study we present a method for recommending to a Facebook user which of his friends 
might be a fake profile and should therefore, be restricted. Our method is based on the connection 
properties between a Facebook user and its friends and by using supervised learning techniques. This 
type of problem is to some degree similar to the problem of predicting link strength, studied by 
Kahanda and Nevill [21], and the problem of predicting positive and negative links (signed links), as 
Leskovec et al. [24j studied. Similarly to the study held by Kahanda and Nevill, in this study we extract 
a different set of meta-content features, such as the number of pictures and videos both the user and 
his friends were tagged in. In this study, we also predict the type of a negative relationship between 
users, similar to the study of Leskovec et al. However, in our study we aim to uncover fake profiles 
rather than identify the link sign or strength between two users. In addition, our study contrasts other 
studies, which used a major part of the social network topology to construct classifiers [211 EH [16] . 
because we construct our classifiers by using only variations of the data collected in real-time from the 
user's point of view rather than data collected from the social network administrator's point of view. 
By using user data, which was obtained in real-time only, we were able to quickly analyze each user's 
friends list with fewer resources and without invading the user's friend's privacy. 

3 Social Privacy Protector Architecture 

To better protect the privacy of Facebook users we developed the Social Privacy Protector software. 
The SPP software consists of three main parts (see Figure [T]), which work in synergy: a) Friends 
Analyzer Facebook application - which is responsible for identifying a user's friends who may pose a 
threat to the users privacy, b) SPP Firefox Add-on - which analyzes the user's privacy settings and 
assists the user in improving privacy settings with just one click, and c) HTTP Server - which is 
responsible for analyzing, storing, and caching software results for each user. In the remainder of this 
section, we describe in detail each individual part of the SPP software. 

3.1 The Friends Analyzer Facebook Application 

The Friends Analyzer Facebook application (also referred to as SPP application) is the part of the SPP 
responsible for analyzing a user friends list to determine which of the user's friends may pose a threat 
to the user's privacy. After the user installs the Friends Analyzer application, the application scans 
the user's friends list and returns a credibility score for each one of the user's friends. Each friend's 
score is created by using simple heuristics or with a more sophisticated machine learning algorithm, 

^An initial version of the SPP software was described, as work in progress, in our previous paper [15] 
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Figure 1: Social Privacy Protector Architecture. 



which takes into account the strength of the connection between the user and his friends. The strength 
of each connection is based on different connection features, such as the number of common friends 
between the user and his friend and the number of pictures and videos the user and his friend were 
tagged in together (see Section |4]). At the end of the process, the user receives a web page, which 
includes a sorted list of all his friends according to each friend's score, where the friends with the lowest 
scores have the highest likelihood of being fake profiles appear on the top of the list (see Figure [2]). 
For each friend in the returned sorted list, the user has the ability to restrict the friend's access to the 
user's private information by simply clicking on the restrict button attached to each friend in the sorted 
list. Moreover, the application provides the user an interface to view all his friends alphabetically and 
easily restricts access with a single click. This option enables Facebook users to protect their privacy 
not only from fake profiles but also from real profiles, such as exes, for whom they do not want to have 
access to their personal data stored in their Facebook profile. 



3.2 Social Privacy Protector Firefox Add-on 

The Social Privacy Protector Firefox Add-on (also referred to as Add-on) is the part of the SPP 
software responsible for improving user privacy settings with just a few simple clicks. After the Add- 
on is installed on the user's Firefox browser, it begins to monitor the user's internet activity. When 
the Add-on identifies that the user logged onto his Facebook account, the Add-on then analyzes the 
number of applications installed on the user's Facebook profile and presents a warning with the number 
of installed applications, which may pose a threat to the user's privacy (see Figure [s]). The Add-on 
also presents the top two results obtained by the Friends Analyzer Facebook application and suggests 
which friends to restrict (see Figure [s]). 

The Add-on also detects when the user has entered Facebook's privacy settings page and presents 
the user with three new privacy setting options. The new privacy settings are based on the user's 
profile type and can be modified with one click (see Figure [i]), instead of the more complex Facebook 
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Figure 2: Friends Analyzer Facebook application - user rank and sorted friends list. 

custom privacy settings that may contain up to 170 options [2J. Using the new Add-on privacy settings 
a user can simply chose the profile type most suitable for him out of three options: a) Celebrity setting 
- in this setting all of the user's information is public, b) Recommended setting - in this setting the 
user's privacy is only public to friends, however some of the user's details, such as profile name and 
pictures, are public, and c) Kids settings - in this setting the profile is only open to the user's friends 
and only friends of friends can apply for friend requests. Using this Add-on a user can easily control 
and improve their privacy without contacting a security expert. Moreover, parents can simply install 
this Add-on on their children's Facebook accounts in order to better protect their children's privacy 
on Facebook without needing to understand Facebook's different privacy setting options. Our Add-on 
is also easy for customizing privacy settings by adding more privacy option settings to different types 
of users. Furthermore, it is easy to customize our privacy settings by adding more optional privacy 
settings for different types of users. In this study, we utilized users data collected by the Add-on to 
further study the privacy settings of Facebook users. 

3.3 HTTP Server 

The HTTP server is the part of the SPP responsible for connecting the SPP Firefox Add-on to the SPP 
Facebook application. When a user installs the SPP software, the server analyzes the user's friends 
list and identifies which of the user's friends may pose a threat on his security. Also, to enhance the 
application's performance, the HTTP server caches parts of the analyzed results. In order to protect 
the user's privacy, the application stores only the minimal number of features in an encrypted manner 
using RC4 encryption. 

4 Methods and Experiments 

In this study our experiments are divided into two main parts. In the first part, we deployed an initial 
version of the SPP software in order to improve user privacy on Facebook. We also used the initial 
version to collect data on each SPP's user and his links. The main focus of this part is calculating 
the heuristic, which sorts the friends list and recommends which friends to restrict. Additionally, in 
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Figure 3: Social Privacy Protector Firefox Add-on - warning about installed applications and friends 
you may want to restrict. 

this part, we also present methods for analyzing the user privacy setting data collected by the Add-on 
to better understand Facebook user privacy settings and understand how much they are exposed to 
different security threats, such as information exposure to fake friends. 

In the second part, we use the data collected in the first part to learn more about Facebook 
user's privacy settings. Furthermore, we used the collected data in order to develop machine learning 
classifiers, which can identify Facebook profiles with higher likelihoods of being fake. Moreover, these 
classifiers can be used to replace the initial SPP heuristic with a more generic model, which can 
provide SPP users recommendations on which friends to restrict. In the second part our main focus 
is on constructing the machine learning classifiers and evaluating their performances. 

In the remainder of this section, we present in detail the methods used in each one of the two parts. 

4.1 Deploying Social Privacy Protector - Initial Version 

After we developed the SPP's software according to the architecture described in Section [3j we had 
to develop a heuristic, which can quickly sort the friends list of each SPP user. In the initial version, 
we developed the heuristic to be as simple as possible and based it upon the hypothesis that most 
fake users do not have strong connections with real users. To estimate the strength of a connection 
between two users, we extracted lists of features and calculated a simple arithmetic heuristic. The 
heuristic 's main constrain was that for every SPP user, it needed to analyze and evaluate hundreds 
or even thousands of connection strengths between the user and each one of his friends in a short 
period of time. Moreover, the heuristic needed to take into consideration the performance of the 
Facebook application API [lOj, in its calculation time of each feature. After testing and evaluating 
several different features, we decided to extract the following features for each SPP user (referred as 
u)^ and each one of his friends (referred as i;): 

1. Are- Family (u,v) - mark if u and v are defined in Facebook as being in the same family. The 
Are- Family feature prevents the cases in which the SPP application will mark a family member 
as a fake profile. 
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Figure 4: Social Privacy Protector Firefox Add-on - optimizing the user's privacy setting with one 
simple click. 

2. Common-Chat-Messages(u,v) - the number of chat messages sent between u and v. We 
assume that in most cases, such as in a fake profile used to send spam messages, there will be no 
chat interaction between the user and the fake profile. However, when there are different types 
of fake profiles, such as fake profiles used by cyber predators, this feature will be less helpful in 
identifying the threat. 

3. Common- Friends (u,v) - the number of mutual friends both u and v poses. The relevance 
of the Common- friends feature is very intuitive. It is expected that the larger the size of the 
common neighborhood, the higher the chances are that the friendship between the users is real. 
The Common- Friends feature was previously used to solve different versions of the link prediction 
problem [25l EH [231 E] was found to be a very useful feature in many of these scenarios. 

4. Common-Groups-Number(u,v) - the number of Facebook groups both u and v are members 
in. It is expected that the higher the number of groups both users are member of, the higher 
the chances that u and v have similar fields of interest, which might indicate that the friendship 
between u and v is real. The Common-Groups-Number feature was used in the study of Kahanda 
and Neville [21] to predict link strength. 

5. Common-Posts-Number(u,v) - the number of posts both u and v posted on each other's wall 
in the last year. Similar post features were studied by Kahanda and Neville and were discovered 
to be very useful in predicting strong relationships between two Facebook users [21]. 

6. Tagged-Photos-Number(u,v) - the number of photos both u and v were tagged in together. 
We assume that most fake profiles have almost no shared tagged photos with the user. The 
Tagged-Photos-Number feature was also used in the Kahanda and Neville study [2T]. 

7. Tagged- Videos-Number(u,v) - the number of video clips both u and v appeared in together. 
As in the case of tagged photos, we assume that most fake profiles have almost no shared tagged 
video clips with the user. 
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Priends-Number(u) and Priends-Number(v) - the total number of friends u and v have. 
These features are usuahy referred to as the degree of u and the degree of and were extracted 
from each user to assist us in improving the supervised learning classifiers, as described in 



Section 4.2 However, we did not use this feature when calculating the connection strength 



heuristics in SPP's initial version. 

In an attempt to build better fake profile prediction heuristics, we also tested the following features: 
a) the number of mutual Facebook likes both u and v gave to each other, and b) the number of 
comments both u and v posted on each other's wall posts. However, although these two features 
seemed promising in assisting us in identifying fake profiles, we did not use them in the end due to 
performance issues, i.e., calculating these two features was too time consuming and inappropriate to 
use in order to receive real-time results. 

After several tests and simple evaluations, we decided to use the following simple heuristic in order 
to define the Connection- Strength function between a user u and its friend v. 

Connection- Strength{u^v) — CS{u^v) \— Common-Friends{u^v) 

+ C ommon-Chat-M essages{u^ v) 

+ 2 • Common-Groups-Number {u, v) 

+ 2 • C ommon- Posts-N umber (u, v) 

+ 2 • Tagged-Photos-Number{u, v) 

+ 2 • T agged-Videos-N umber {u^ v) 

+ 1000 • Are-Family(u, v) 

In order to tell SPP users which friends ought to be restricted, we ranked each user's friends list 
according to the Connection-Strength(u,v) function. Due to Facebook's estimations that 8.7% of all 
Facebook users do not belong to real profiles [12j, we presented to each SPP user the top 10% of his 
friends who received the lowest Connection-Strength score (see illustration in Figure |2]). 

To evaluate the performance of the Connection-Strength heuristic, we calculated three statistics 
on the heuristic's performances in restricting friends. First, we calculated the heuristic's restricting 
precision for different Connection-Strength values. This calculation was performed by measuring, for 
different Connection-Strength values, the ratio between the number of friends who were restricted and 
the total number of friends who received the exact Connection-Strength value. Second, we calculated 
the restriction rates according to the friends' ranking positions in the restriction interface. This cal- 
culation was performed by measuring the percentage of friends, which were restricted in each position 
in the restriction interface. Lastly, we also calculated the heuristic's average users precision at k for 
different k values, in the following manner. First, for each SPP user u^ which had at least k friends, 
we calculated the user's Connection-Strength average precision at k. This was done by selecting k 
friends, which received the lowest Connection-Strength values with u out of all u^s friend^ We then 
calculated the ratio between the number of friends u had restricted among the selected friends and k. 
After we finished calculating the Connection-Strength average precision at k for each user u, we then 
continued to calculate heuristic's average users precision at A:, by simply summing up all the users' 
Connection-Strength average precision at A:, and dividing the sum by the number of users with at least 
k friends. A formal arithmetical definition of the heuristic's average users precision at k is as follows: 

CS-Avg-Precisionik) ^{-^Users\\frienMu)\>k}Pu(k) 



\{u G Users\\friends{u)\ > k}\ 

Where Users is a set, which contains all SPP users, friends{u) is a set which contains all of u^s 
Facebook friends, and Pu{k) defined to be the heuristic's precision at k for a user u: 

p _ ^{fefriends{u)\3f,,...J^_kefriends{u)yje[i,..,n-k^^ 



"^In case more than k friends received the lowest Connection- Strength values, we randomly removed friends with the 
highest Connection- Strength values, until we were left with exactly k friends. 
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Where is-restricted{u, f) is a function, which returns 1 if a user u had restricted his friend / or 
otherwise. 

The goal of the presented Connection-Strength heuristic was not to identify fake profiles in an 
optimal way, rather to make Facebook users more aware of the existence of fake profiles and the fact 
that these types of profiles can threaten their privacy. Additionally, we wanted to collect a unique 
dataset that would contain labeled profiles with high likelihoods of being fake profiles. 

In addition to collecting meta-content features through the SPP apphcation, the SPP Firefox 
Add-on also collected the following users defined privacy settings each time the user used the Add-on: 

1. Installed- Application-Number - the number of installed Facebook applications on the user's 
Facebook account. 

2. Default-Privacy-Settings - the default user's privacy settings on Facebook, which can be one 
of these values: public, friends or custom. This setting is responsible for the scope in which 
content created by the user will be exposed by default. For example, if the default privacy 
settings are set to "friends" by default, only the user's friends can see his posted content. The 
SPP's Add-on only stored this value if the user's privacy was set to public or friends. This 
privacy setting default value for new Facebook users is set to public. 

3. Lookup - regulates who can look up the user's Facebook profile by name. This setting can take 
one of the following values: everyone, friends or friends of friends. This privacy setting default 
value for new Facebook users is set to everyone. 

4. Share- Address - this value is responsible for defining who can see the user's address. This 
setting can take one of the following values: everyone, friends or friends of friends. This privacy 
setting default value for new Facebook users is set to everyone. 

5. Send-Messages - this value is responsible for defining who can send messages to the user. This 
setting can take one of the following values: everyone, friends or friends of friends. This privacy 
setting default value for new Facebook users is set to everyone. 

6. Receive-Friend-Requests - this value is responsible for defining who can send friend request 
to the user. This setting is limited to two values only : everyone and friends of friends. The 
default value for this setting is everyone. 

7. Tag- Suggest ions - this value is responsible for defining which Facebook users will receive photo 
tag suggestions when photos that look like the user have been uploaded onto Facebook. This 
setting can take one of the following values: no one or friends. This privacy setting default value 
for new Facebook users is set to friends. 

8. View-Birthday - this value is responsible for defining who can view the user's birthday. This 
setting can take one of the following values: everyone, friends or friends of friends. This privacy 
setting default value for new Facebook users is set to friends-of- friends. 

By analyzing and monitoring the privacy settings, we can learn more about the SPP user's privacy 
settings on Facebook. In addition, we can estimate how vulnerable Facebook users' information is 
to fake profile attacks. Furthermore, by analyzing the collected privacy settings, we can also identify 
other potential privacy risks, which are common to many different users. 

4.2 Supervised Learning 

After, we deployed the SPP software and gathered enough data on which friends SPP users had 
restricted, and which friends they had not restricted, our next step was to use supervised learning 
techniques to construct fake profile identification classifiers. To construct the fake profile identification 
classifiers, we first needed to define the different datasets and their underlining features. Next, we used 
different supervised learning techniques to construct the classifiers. Lastly, we evaluated the classifiers 
using different evaluation methods and metrics. 
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In the remainder of this section we describe, in detail, the process of constructing and evaluating 



our classifiers. First, in Section 4.2.1, we describe how we defined the different datasets and their 



features. Second, in Section 4.2.2, we describe which methods were used to construct our classifiers 



and evaluate their performance. 
4.2.1 Datasets and features 

The SPP application's initial version collected and calculated many different details about each con- 



nection between a SPP user and each one of his friends in real-time (see Section 4.1). Moreover, the 
SPP application presented the user with two interfaces for restricting his friends. The first restriction 
interface (referred to as the recommendation interface) presents the user with a list with the 10% of his 
friends who received the lowest Connection-Strength score. The second restriction interface (referred 
to as the alphabetical interface) presents the user with all of his friends in alphabetical order. Using 
these two restriction interfaces, we defined four types of links sets, two unrestricted links sets, and two 
restricted links sets: 

1. All unrestricted links set - this set consists of all the links between the application users and 
their Facebook friends who were not restricted by the application. 

2. Recommended unrestricted links set - this set consists of all the links between the applica- 
tion users and their Facebook friends who were recommended for restriction by the application 
due to a low Connection-Strength score, but who were not restricted by the user. 

3. Recommended restricted links set - this set consists of all the links between the application 
users and their Facebook friends who were recommended for restriction by the application due 
to a low Connection-Strength score and who were restricted by the user. 

4. Alphabetically restricted links set - this set consists of all the links between the application 
users and their Facebook friends who were not recommended for restriction by the application. 
However, the user deliberately chose to restrict them by using the alphabetical interfac^ 

Using the defined above links sets, we define the following three datasets: 

1. Fake profiles dataset - this dataset contains all the links in the Recommended unrestricted 
links set and in the All unrestricted links set. Namely, this dataset contains all friends who were 
restricted due to a relatively low Connection-Strength and all friends who were not restricted. 
Therefore, we believe that this dataset is suitable for constructing classifiers, which can predict 
friends, who mostly represent fake profiles, the user need to restrict. We believe that this dataset 
is suitable for replacing the Connection-Strength heuristics with a generic classifier, which can 
recommend to a SPP user, which friends to restrict. In addition, the classifiers constructed from 
this type of dataset can assist online network administrators in identifying fake profiles across 
the entire the network. 

2. Friends restriction dataset - this dataset contains all the links in the alphabetically restricted 
links set^ and in the All unrestricted links set. Namely, this dataset contains all the friends who 
were not restricted and all the friends who were restricted deliberately by the user, although 
they were not recommended by the SPP application. Therefore, we believe that this dataset is 
suitable for constructing classifiers, which can predict friends, who mostly represent real profiles 
the user prefers to restrict. 

3. All links dataset - this dataset contains all the links in all four disjoint links sets. According 
to the dataset definition, this dataset is the largest among all defined datasets, we believe that 
like the Fake profiles dataset this dataset can be suitable for replacing the Connection-Strength 
heuristics with a generic classifier, which can recommend to a SPP user which friends to restrict. 



^If a restricted user's friend was presented in the recommendation interface and was restricted by using the alpha- 
betical interface, the link between the user and the restricted friend was assigned to the recommended restricted links 
set. 



11 



For each link in the above defined links datasets, the SPP application calculated all of the 8 first 
link features defined in Section 4.1 in real-time including the Friends- Numb er(i;)[^ (referred to as the 



Friend Friends- Number) . In addition, if it was arithmetically possible, we also calculated the following 
set of seven features: 

1. Chat-Messages-Ratio (u,v) - the ratio between the number of chat message u and v sent to 
each other and the the total number of chat messages u sent to all of his friends. The formal 
Chat-Messages-Ratio definition is: 

7-. / \ Common-Chat-Messages(u,v) 

Chat-Messages-Ratio[u,v) := — — 

T.f^friends{u) C ommoTi-Chat-M essages{u, f) 

Where friends{u) is defined to be a set which contains all the friends of u. 

2. Common-Groups-ratio(u,v) - the ratio between the number of Facebook groups both u and 
V have in common and the maximum number of groups which u and all of his friends have in 
common. The formal Common- Groups- Ratio is: 

Common-Groups - Ratio{u,v) : = 

Common-Groups-Number (u, v) 



max {{Common-Gr cup s-N umber {u ^ f)\f G friends{u)}) 



3. Common-Posts-Ratio(u,v) - the ratio between the number of posts both u and v posted on 
each others walls and the total number of posts which u posted on all his friends' walls. The 
formal Common-Posts-Ratio is: 

C ommon- Posts-N umber {u, v) 



Common- Post s-Ratio{u, v) 



T.fefriends{u) C ommon-Po sts- Number {u, f) 



Common-Photos-Ratio(u,v) - the ratio between the number of tagged photos both u and v 
were tagged in together and the total number of photos, which u were tagged in. The formal 
Common-Photos-Ratio is: 

Common-P kotos- N umber {u^ v) 



C ommon-Photos-Ratio(u, v) 



T.fefriends(u) C ommon-Phot OS- Number {u, f) 



Common- Video- Ratio (u,v) - the ratio between the number of videos both u and v were 
tagged in together on and the total number of videos, which u were tagged in. The formal 
Common-Video-Ratio is: 

Common-Video-N umber (u^ v) 



Common-Video-Ratio{u^v) 



T.fefriends(u) Common-Video-Number(u, f) 



6. Is-Friend-Profile-Private(v) - in some cases the SPP application did not succeed in collecting 
v^s friends number {Friends-Number (v)) ^ and succeeded in collecting the Common-Friends (u,v) 
value, which returned a value greater than zero. This case may indicate that v^s profile is set to 
be a private profile. With these cases in mind, we defined the Is-Friend-Profile-Private function 
to be a binary function, which returns true values in case the application did not succeed in 
collecting v^s friends number and succeeded in collecting the Common-Friends (u,v) with a value 
greater than zero, or a false value otherwise. 

7. Jaccard's-CoefRcient(u,v) - Jaccard's-coefficient is a well-known feature for link prediction 
[25l [2T| riT] . The Jaccard's coefficient is defined as the number of common- friends u and v 
have divided by the sum of distinct friends both u and v have together. The formal Jaccard^s- 
Coefficient definition is: 

Jaccard's - Coefficient{u,v):= 

Common- Friends{u^ v) 
Fr lends- N umber {u) + Friends-N umber (v) — C ommon- Friends {u, v) 
A higher value of Jaccard's-coefficient denotes a stronger link between two Facebook users. 



^In some cases we were not able to extract the user's (v) friends number probably due to the v^s privacy settings. 
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4.2.2 Classifiers Construction and Evaluation 



Using the three datasets and the 15 features defined in the previous sections, we constructed classifiers 
for fake profile identification and for recommending profiles for restriction. 

The process of constructing and evaluating the different classifiers was as follows. First we matched 
the suitable datasets for each type of classification mission in the following manner: a) for identifying 
fake profiles we used Fake profiles dataset^ b) for recommending real profiles for restriction we used the 
Friends restriction dataset^ and c) for replacing recommending to SPP users which real and fake friends 
to restrict we used the All links dataset. Next, for each link in each one of the datasets, we extracted the 



15 features defined in Sections |4.1| and |4.2.H and created a vector set for each link. Furthermore, we 
added an additional binary target feature that indicates if the link between the SPP user and his friend 
was restricted by the SPP user to each link's features vector set. Due to the fact that the majority of 
the links in each dataset had not been restricted, these datasets were overwhelmingly imbalanced with 
restricted links as a minority class. Therefore, a naive algorithm that always predicts "not restricted", 
will present good prediction precision. To overcome the datasets imbalance problem we used a similar 
undersampling methodology used by Guha et al. to predict trust [19] and by Leskovec et al. [24j to 
predict positive and negative links. According, to this methodology we transform each imbalanced 
dataset into a balanced dataset by combing all the restricted links in each dataset and adding to them 
an equal number of randomly selected unrestricted links from each dataset. Afterwards, we used the 
balanced datasets with the updated extracted links' features vector sets to construct several classifiers 
by using WEKA [20], a popular suite of machine learning software written in Java and developed at 
the University of Waikato, New Zealand. We used WEKA's C4.5 (J48), IBk, NaiveBayes, Bagging, 
AdaBoostMl, RotationForest, and RandomForest implementations of the corresponding algorithms. 
In addition, we used the simple OneR classifier as a baseline for the performance of the other classifiers. 
For each of these algorithms, most of the configurable parameters were set to their default values with 
the following exceptions: for C4.5, the minimum number of instances per leaf parameter was between 
the values of 2, 6, 8 and 10; for IBk, its k parameter was set to 5 and 10; The ensemble methods were 
configured as follows: The number of iterations for all ensemble methods was set to 100. The Bagging, 
AdaBoostMl, and RotationForest algorithms were evaluated using J48 as the base classifier with the 
number of instances per leaf set to 4, 6, 8, and 10. Next, we evaluated our classifiers using the common 
10-folds cross validation approach. We used the area-under-curve (AUG), f-measure, true-positive and 
false-positive rate to evaluate the different classifiers' performances. Additionally, in order to obtain 
an indication of the usefulness of the various features, we also analyzed the features importance by 
using WEKA's information gain attribute selection algorithm. 

Furthermore, in order to evaluate the classifiers recommendations precision at top k (precision@k) , 
we selected the machine learning algorithm, which presented the highest AUG in the above evaluations 
and used two evaluation methods to measure the performance of the algorithm on the different datasets. 
In the first evaluation method, we split our datasets into training sets and testing sets. For each one of 
the three balanced datasets, we randomly split each dataset into a training dataset, which contained | 
of the labeled instances and a testing dataset, which contained | of the labeled instances. Afterwards, 
we constructed classifiers using the training dataset only. Next, we used the classifiers to classify 
the profiles in the testing dataset and sorted the instances according to the classifiers' prediction 
probabilities in descending order, where the links, which received the highest probability of being 
restricted were first. We then evaluated the classifiers' predictions precisions for the top k predictions, 
for different values of k. 

In the second evaluation method, our goal was to measure the classifiers recommendations average 
users precision at k. To achieve this goal we used the following method. First, we selected a user out of 
all SPP users. Afterwards, we created a training dataset using all SPP users' links without the selected 
user's links. Next, we balanced the training dataset using the same undersampling method described 
above. Afterwards, we constructed a classifier using the training dataset and used the selected user's 
links as a testing dataset. We used the constructed classifier to predict the probability of restriction 
for each link in the selected user's links. We then sorted the classifier's predictions in descending order 
where the links, which received the highest probability of being restricted were first. Subsequently, we 
measured the classifier's predictions precision for different k values. Lastly, we repeated this process 
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for each one of all SPP users and calculated the average classifiers' precisions for different k values. 

Using these evaluation methods, we were able evaluate how precise our classifiers are in recom- 
mending which friends to restrict both from the SPP users' point of view, and from the online social 
network administrator point of view. 



5 Results 



The initial version of the SPP software was formally launched at the end of June 2012 as free to use 
software [14^ il5 j . The software launch received massive media coverage with hundreds of online articles 
and interviews in leading blogs and news websites, such as Fox news p] and NBC news [32j. 

Due to the media coverage, from the 27th of June, 2012 to the 10th of November, 2012, 3,017 users 
from more than twenty countries installed the SPP application out of which 527 users used the SPP 
application to restrict 9,005 friends, with at least one friend restricted for each user. In addition, more 
than 1,676 users had installed the SPP Firefox Add-on and removed at least 1,792 applications. 

In the remainder of this section we present the results obtained from analyzing the collected SPP 



software data in the following manner. First, in Section 5.1, we present the datasets obtained by the 



SPP application. Afterwards, in Section [5^21 we present the results obtained by our machine learning 
classifiers. Lastly, in Section [53] we present statistics on the success of our Add-on to assist Facebook 
users in removing unneeded applications from their profiles. Furthermore, in this section, we also 
presented different statistics about Facebook user privacy settings obtained from the SPP Add-on. 



5.1 Collected Datasets 

After the initial software launch the SPP application was installed by 3,017 users out of which 527 
users had restricted 9,005 friends. All friends were restricted between the 27th of June, 2012 and the 
10th of November, 2012. To our great surprise 355 SPP application users used the application not only 
to remove users that received low Connection-Strength score, but to also search and restrict specific 
friends that are probably real profiles by name. 

Using the collected users' data we created three datasets as described in Section 4.2. 1| (see Table [TJ. 
The first dataset was the Fake-profiles dataset, this dataset contained 141,146 out of which the 434 SPP 
users had restricted 2,860 links (2.03% of all links), which were recommended by the SPP application. 
The second dataset was the Friends-restriction dataset, this dataset contained 144,431 links out of 
which the 355 users had restricted 6,145 links (4.25% of all links), that were specifically chosen for 
restriction by the users using the Alphabetical-interface. The last dataset was the All links dataset, 
which contained 151,825 links out of which 9,005 (6.01% of all links), were restricted. As expected all 
three datasets were overwhelmingly imbalanced with imbalance rates ranging from 2.03% to 6.01%. 



Table 1: Links Datasets 





Users Number 


Restricted Link 


Unrestricted Links 


Total Links 


Fake-Profiles 


434 


2,860 (2.03%) 


138,286 


141,146 


Friends Restriction 


355 


6,145 (4.25%) 


138,286 


144,431 


All Links 


527 


9,005 (6.01%) 


138,286 


147,291 



To better understand the differences between the restricted links features and unrestricted links 
features, we calculated the average values of each extracted feature in each dataset for each link type 
(see Table [2]). It can be noted that in all the examined features, except for the Friend Friends- Number 
and Is-Friend-Profile-Private features, the restricted links features received a lower average than the 
unrestricted links features in each dataset. 

To understand how well the Connection-Strength heuristic performed, we calculated, as described 



in Section 4.1, the heuristic's restricting precision for different Connection-Strength values (see Fig- 
ure [s]), the heuristic restriction rates according to the friends' ranking positions in the Restriction 
interface (See Figure [6]), and the heuristic's average users precision at k for different values of k (see 
Figure (r]). 
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Table 2: Features Average Values for Different Datasets 



Feature 


Link Type 


Fake 
Profiles 


Friends 
Restriction 


All Links 


Are- Family 


Restricted 
Unrestricted 




9 links 


1 link 
9 links 


1 link 
9 links 


Common-Chat-Messages 


Restricted 
Unrestricted 


0.02 
30.86 


6.35 
30.86 


4.34 
30.86 


Common- Friends 


Restricted 
Unrestricted 


1.44 
36.78 


19.8 
36.78 


13.97 
36.78 


Common-Groups-Number 


Restricted 
Unrestricted 


0.028 
0.689 


0.56 
0.689 


0.392 
0.689 


Common-Posts-Number 


Restricted 
Unrestricted 


0.008 
0.147 


0.069 
0.147 


0.049 
0.147 


Tagged-Photos-Number 


Restricted 
Unrestricted 


0.004 
0.3 


0.208 
0.3 


0.143 
0.3 


Tagged- Videos-Number 


Restricted 
Unrestricted 




0.017 


0.007 
0.017 


0.005 
0.017 


Friend Friends-Number 


Restricted 
Unrestricted 


627.31 
703.31 


819.57 
703.31 


756.25 
703.31 


Chat-Message- Ratio 


Restricted 
Unrestricted 


2.46 • 10 ° 
0.004 


0.003 
0.004 


0.002 
0.004 


Common-Groups-Ratio 


Restricted 
Unrestricted 


0.006 
0.118 


0.108 
0.118 


0.076 
0.118 


Common-Posts-Ratio 


Restricted 
Unrestricted 


0.0003 
0.0035 


0.003 
0.0035 


0.002 
0.0035 


Common-Photos-Ratio 


Restricted 
Unrestricted 


2.23 • 10 ° 
0.0034 


0.003 
0.0034 


0.002 
0.0034 


Common- Video- Ratio 


T~> J- • J- 1 

Kestricted 
Unrestricted 




0.0027 


0.001 
0.0027 


0.0007 
0.0027 


Is-Friend-Profile-Private 


Restricted 
Unrestricted 


5.87% 
9.81% 


10.79% 
9.81% 


9.23% 
9.81% 


Jaccard's-CoefRcient 


Restricted 
Unrestricted 


0.003 
0.045 


0.034 
0.045 


0.024 
0.045 



Although the Connection-Strength heuristic was quite simple, it presented an average users preci- 
sion of 33.6% at 1, an average users precision of 27.1% at 10, and an average users precision of 11% 
at 100 (see Figure [T]). In addition, 31.7% of the friends, which appeared in the second position in 
the Restriction interface, due to a low Connection-Strength score, were actually restricted by the SPP 
users (See Figure [6]). Furthermore, 28% of the SPP users' friends who received a Connection-Strength 
of were also restricted (see Figure [5]). However, the friends' restriction rates sharply declined when 
the Connection-Strength score increased. For example, only 10%, of the users' friends, which received 
a Connection-Strength equal to 3 were actually restricted. 



5.2 Classifiers' Results 

From the three imbalanced datasets, we created three balanced datasets by using all the restricted 
links in each dataset and randomly choosing an equal amount of unrestricted links. We then used 
the balanced dataset and evaluated the specified machine learning algorithms (see Section 4.2.2) using 
a 10-fold cross-validation approach. The evaluation results of the different classifiers are presented 
in Figure [8] and in Table [3j It can be seen that on all datasets the Rotation-Forest classification 
algorithm presented the best AUC results among all the ensemble classifiers and the J48 decision tree 
classification algorithm presented the best results among all the non-ensemble classifiers. In addition. 
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Figure 5: Friends restriction precision for different Connection-Strength values - it can be noted that 
among ah users' friends, which received a Connection-Strength of 3, only 10% were actually restricted. 
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Figure 6: Connection-Strength restriction rates according to friends' ranking positions in the Restric- 
tion interface - it can be noted that among all friends, which were ranked in the first position in the 
friends Restriction interface 31.1% were actually restricted by the SPP users. 

it can be noted that on all datasets the Rotation- Forest classifier presented considerably better results 
than the simple OneR classifier, which we used as baseline. 

After, we discovered that the Rotation-Forest classifier presented the best overall results, we evalu- 
ated the Rotation-Forest classifier precision at k for different values of k on the different datasets. We 
first calculated the classifier precision for different k values by splitting each dataset into a training 
dataset, which contained | of the links and a testing dataset, which contained ^ of the links. The 
results of this precision at k evaluation are presented in Figure |9j It can be noted that the Rotation- 
Forest classifiers presented precision at 200 of 98%, 93%, and 90% for the Friends restriction dataset. 
Fake profiles datasets, and All links dataset respectively. In addition, it can been noted that the 
classifiers' precision at 500 was 94%, 91%, and 88% for the Fake profiles datasets. Friends restriction 
dataset, and All links dataset. Hence, out of 500 links, which ranked were by the the Rot at ion- Forest 
classifiers as links with the highest likelihood of being restrict by the SPP application users, 470, 455, 
and 440 links were actually restricted in the Fake profiles datasets. Friends restriction dataset, and All 
links dataset respectively. 

In order to estimate the classifiers' recommendations precision according to the SPP users' point 
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Figure 7: Connection-Strength average users precision at /c - it can be noted that the Heuristic's 
average users precision at 1 and average users precision at 100 was 33.6% and 11.1% respectively. 
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Figure 8: Classifiers AUC on the Different Datasets - it can be noted that the Rot at ion- Forest 
classifier received the highest AUC rates on all three datasets. 



of view, we also calculated the Rotation-Forest classifier average users precision fc, as described in 

J6l 



Section 



4.2.2 



on the different datasets for different values of 



The results of this precision at k evaluation are presented in Figure [10) It can be noticed that the 
Rotation-Forest classifiers presented precision at 10 of 24%, 23%, and 14% for the All hnks dataset. 
Fake profiles datasets, and Friends restriction dataset respectively. The Rotation-Forest classifier's 
results on the All links dataset indicated that on average 2.4 of the users' friends, which received 
the top ten highest probabilities of being restricted among all the friends of each user were actually 
restricted. However, the Rotation-Forest classifier's results on the Restricted profiles dataset indicated 
that on average only 1.4 of the users' friends, which received the top ten highest probabilities of being 
restricted among all the friends of each user had actually been restricted. 

To obtain an indication of the usefulness of the various features, we also calculated the differ- 
ent features importance using WEKA's information gain attribute selection algorithm (see Table [i]). 
According to the information gain selection algorithm the top two most useful features on all three 
datasets were the Common-Friends feature and the Jaccard^s- Coefficient feature. Furthermore, ac- 
cording to the results there are differences between the features scores in the different datasets. For 



^In case of the Friends profiles datasets, we calculated the average users precision for 355 SPP application users only, 
which for certain were familiar with alphabetical interface and used it to restrict their friends. 
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Table 3: Classifiers' Performance on the Different Datasets 



Classifier 


IVEeasure 


Fake Profiles 


Friends Profiles 
Restriction 


\ 11 T ttrtlrcn 

ijinKs 






U.odI 


0.511 


O.dOo 


OneR 


i:^ -Measure 


(J.8d7 


0.531 


O.dId 


False-Positive 


u.i / y 


U.ooz 


U.4i4 




True- Positive 


u.yuz 


n /I 


U.DZO 




AUG 


0.925 


0.684 


0.72 


J48 


J:^ -Measure 




0.dd8 


0.d59 


False-Positive 


u. 1 / y 


u.4yo 


U.ozi 




True- Positive 


(J.yoY 


0.754 


U.D45 




AUG 


0.833 


0.587 


0.545 


IBK (K=10) 


J:^ -Measure 
False-Positive 


0.744 
n 1 7/1 

U.i / 4 


0.49 
u.zoy 


O.Do7 

n 7/iQ 
u. / 4y 




True- Positive 


U.byb 


n /1 1 n 

u.4iy 


U.ol / 




AUG 


0.902 


0.645 


0.663 


Naive-Bayes 


b -Measure 
False-Positive 


f\ o o o 
O.OOO 

n Q7Q 


0.d78 

U.oDD 


0.287 

U.UOD 




True- Positive 


u.yyy 


0.955 


0.177 




AUG 


0.946 


0.73 


0.75 


Bagging 


J:^ -Measure 


0.89 


0.d77 


0.675 


False-Positive 


n 1 71 
U.i / i 


U.4Uo 


n Q 
U.o 




True- Positive 


u.yoo 


U. / 1 / 


U.DDZ 




AUG 


0.937 


0.698 


0.728 


AdaBoostMl 


TT^ ix /r 

J:^ -Measure 
False-Positive 


r\ o o o 

0.882 

U. iDo 


0.645 

U.4Uo 


f\ rr >'7 

0.657 

U.oiz 




True- Positive 


n n /1 1 
U.y4i 


U.D / 1 


n /I Q 




AUG 


0.948 


0.79 


0.778 


Rotation- Forest 


TT^ TV /r 

i:^ -Measure 
False-Positive 


0.897 
n 1 ^ft 

U. iOo 


0.719 

U.ooD 


0.696 

U.Z / 




True-Positive 


0.941 


0.75 


0.681 




AUG 


0.933 


0.706 


0.716 


Random- Forest 


F-Measure 
False-Positive 


0.858 
0.14 


0.613 
0.278 


0.663 
0.369 




True-Positive 


0.857 


0.565 


0.679 



example, the Common-Groups-Ratio feature received a value of 0.113 in the Fake-profile dataset and 
a value of only 0.004 in the Friends-restriction dataset, and the Is- Friend- Profile- Private received a 
value of 0.056 in the Friends-restriction dataset and a value of only 0.0002 in the All links dataset. 

5.3 Add-on Results 

The SPP Firefox Add-on was downloaded more than 1,67^ times between the 27th of June, 2012 
and the 10th of November, 2012. During that time we succeeded in collecting data with the number 
of installed Facebook applications from 1,676 different Facebook users. This data was collected on 
21,524 different occasions. Furthermore, we also succeeded in collecting SPP users' privacy settings of 
at least 67 Facebook users on 129 different occasion^ 

^The SPP Add-on was available for download from several locations, such as the Firefox Add-ons website and the 
PrivacyProtector.net website. Due to the fact that not all locations store the number of downloads, we can only estimate 
the number of downloads according to our HTTP Server logs. 

^Due to the fact that not all SPP users opened their Facebook privacy settings during this time period, and probably 
due to problems in parsing the different Facebook privacy settings page layouts, we succeeded in collecting the SPP 
users' privacy settings for only a limited number of users. 
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Figure 9: Rotation- Forest Precision@k - it can been seen that the classifiers' precision at 100 was 
98%, 91%, and 91% for the Friends restriction dataset. Fake profiles datasets, and All links dataset 
respectively. 



Table 4: Information Gain Values of Different Features for Different Datasets 
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^521 


0.466 
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0.03 
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0.029 


0.005 
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Friends Restriction 


0.036 
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0.003 
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0.001 





All Links 
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0.001 





Average 
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0.019 


0.017 
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By analyzing the collected applications data we discovered that the number of Facebook appli- 
cations installed on users' profiles, at the time they initially installed our Add-on, ranged from one 
installed application to 1,243 installed applications, with an average of 42.266 applications per user. 
Moreover, according to the installed applications distribution, we can observe that about 34.96% of 
the users have less than ten applications installed on their profiles. However, 30.31% of the users have 
at least 40 installed applications and 10.68% have more than 100 applications installed on their profiles 
(see Figure 11). 

In addition to calculating the statistics on the number of installed Facebook applications, we also 
tested if the SPP users had used the Add-on to remove part of their installed Facebook applications. In 
order to identify if a user has removed the Facebook applications using the Add-on, we check what the 
user's installed applications numbers up to a day after the Add-on is initially installed. Our Add-on 
succeeded to collect the data of 626 users a day after the Add-on is initially installed. Out of these 626 
users 111 (17.73%), had removed 1,792 applications, while 149 (23.8%), users added 192 applications, 
and 366 (58.47%) users did not add or remove any applications (see Figure 12). A closer look at the 
application removal data reveals that on average each user from the 111 users removed 34.7% of all 
installed applications and 32 (28.8%), users had removed at least 50% of all their installed applications 
(see Figure 13) . 

If we look at the overall time period of our experiments, from the 27th of June, 2012 to the 10th 
of November, 2012, we can see that out of 1,676 users 335 (19.99%), users decreased the number of 



19 




■Fake Profiles Dataset ^^"Friends Restriction Dataset ^ All Link Dataset 



Figure 10: Rotation-Forest average users' precision@k - it can been seen that the classifiers' average 
users precisions at 20 were 21%, 20%, and 14% for the Ah hnks dataset. Fake profiles datasets, and 
Friends restriction dataset respectively. 
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Figure 11: Distribution of The Number of Installed Facebook Applications - it can be noted that 
10.68% of the users have more than 100 applications installed on their profiles. 

installed applications on their profiles. These users had removed 5,537 applications with an average 
of 16.52 application removals per user and a median of seven. 

In addition to checking how many applications were removed, we also checked how many new 
applications were installed to each of the Add-on users. To achieve this goal, we first focused on the 
group of Add-on users, which to the best of our knowledge had more applications installed on their 
profile at the end of the 10th of November, 2012 than in the day they first installed the Add-on. For 
these user groups, we calculated how many applications were added to their profiles on an average 
each week. We discovered that out of 1,676 users, 389 (23.2%), users increased the number of installed 
applications on their profile ranging from 0.05 to 107.33 average of new application installations per 
day (with a median of 0.636 and an average of 1.91). 

We also analyzed the distribution of the privacy-settings collected from the 67 unique Add-on users 
(see Table [5]). It can be noticed that 74.62% of the users set their default privacy settings to be exposed 
to everyone. Moreover, according to the users privacy settings it can be noticed that almost all the 
user information, except Tag-Suggestions^ is exposed to the friends of the user. In addition, we also 
tested how many Add-on users changed their privacy settings during this time period, and discovered 
that according to our logs 14 (20.9%) Add-on users changed their privacy settings. However, after a 
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Figure 12: Distribution of the Difference in Installed Applications Number Day After the Add-on 
Installation. 

short while the majority of these fourteen users returned to their old, less restricted privacy settings. 



Table 5: Add-on Unique Users Privacy Settings Distribution 
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Send 
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74.62% 


74.62% 




74.63% 






11.94% 
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23.88% 


25.38% 


100% 


25.27% 




52.87% 




Friends 










86.76% 




88.06% 


of Friends 
















No one 












23.88% 





6 Discussion 

By analyzing the results presented in Section [5} we can notice the following: 

First, we notice that the initial SPP application results presented relatively good performances. 
Although we defined the Connection-Strength heuristic to be quite simple; it presented remarkable 
precision, where on average 31.1% of the users' friends, which were ranked in first place were actually 
restricted (see Figure [6]). Furthermore, the heuristic also presented an average users precision of 33.6% 
at 1, an average users precision of 27% at 10, and an average users precision of 11% at 100 (see 
Figure [T]). However, the Connection-Strength heuristic was not able to present a generic method, with 
high true-positive rates and low false-positive rates, for recommending the restriction of links. For 
example, only 10% of the SPP users' friends who received a Connection-Strength with a value of 3 
were restricted (see Figure [s]) . 

Second, among all tested machine learning algorithms the Rotation-Forest classifiers performed 
best on all datasets, with especially good results for AUC of 0.948 and a false-positive rate of 15.8% 
on the Fake profiles dataset (see Table [3]). 

Third, according to the results, the Rotation-Forest classifiers' average users precision at 1 on the 
All links dataset datasets and on the Fake-profiles it was 21% and 20%, respectively. These results 
were worse than the results presented by the Connection-Strength heuristic, which presented an av- 
erage users precision at 1 of 34%. However, the classifiers average users precision at k for higher 
k values was nearly the same as the Connection-Strength heuristic's precision at k. For example. 
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Application Removal 
Rate 

■ 90%- 100% 

■ 70%-90% 

■ 50%-70% 

■ 30%-50% 

■ 20%-30% 

■ 10%-20% 

■ 0%-10% 




Figure 13: Removed Application Percentage - every slice of the pie represent the percent of applications 
removed by a percent of users. For example, 9% of the Add-on users removed between 90%- 100% of all 
applications installed on their profiles. It can be noted that about 29% of all the users have removed 
more than 50% of all applications installed on their profiles. 



Connection-Strength heuristic average users precision at 20 was 22%, while the Rotation- Forest clas- 
sifiers' average users precision at 20 was 21% on All links dataset and 20% on Fake-profiles datasets 
(see Figures [T] and 10). Nevertheless, using Rotation- Forest classifiers has many advantages, which the 
Connection-Strength heuristic does not have, such as presenting a generic model for links restriction 
recommendation, and presenting the restriction probability for each link in the network without the 
need to compare each link to other links of the same user as is done in the case of the Connection- 
Strength heuristic. 

Fourth, in contrast to the other classifiers, the classifier which was constructed from the Restricted 
friends dataset using the Rotation- Forest algorithm presented a relatively low average users precisions 



at 1 of 14% (see Figure 10). However, this precision is significantly better than the precision obtained 
by random guessing, which stands on 4.25%, in the Friends restriction dataset. We assume that this 
classifier presented a relatively low performance because the restricted friends in this dataset were 
mainly real friends, which the SPP users chose to restrict for different reasons. We assume that these 
reasons cannot be inferred from the features we extracted in the SPP's initial version. 

Five, when the Rotation- Forest classifiers were evaluated on the general scenario of predicting which 
links to restrict among all users' links, the classifiers presented very high precision rates. For example, 
the Rotation-Forest classifier, which was constructed from Fake profiles dataset links presented 91% 
precision at 100 and 94% precision at 500 (see Figure [9]). Moreover, the Rotation- Forest classifier, 
which was constructed from the Friends restriction dataset presented impressive precision at 100 of 
98%. These results indicate that the classifiers can be used not only by the social network users, but 
also by the online social network administrator in order to identify fake profiles among all profiles on 
the network. 

Six, according to the information gain results we can conclude that on all datasets the most 
useful features were the Common- friends feature and the Jaccard's-Coefficent feature (see Table [4|). 
Additionally, the Is-Friend-Profile-Private was found to be very useful in the case of the Friends 
restriction dataset indicating that friends, which have their profile set to be private have a higher 
likelihood of being restricted. Moreover, according to these results it is noticeable that the Is Family^ 
and the Tagged- Video-Number features were not so useful. Removing the extraction of these features 
in future versions can assist in improving the SPP application run time without significantly affecting 
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the results. 

Seven, according to the apphcations statistic results it can be noted that many Facebook users 
installed many applications on their Facebook accounts, which can jeopardize their privacy. According 
to our results out of 1,676 examined Facebook users, 30.31% of the users had at least forty installed 
applications and 10.68% of the users had more than a hundred installed applications (See Figure 11). 
Moreover, according to our results, out of the 111 users, which used the SPP Add-on for application 
removal, 28.2% removed more than 50% of all their installed applications a day after they installed the 
SPP Add-on (see Figure 13). These results indicate that in many cases the installed applications are 
unwanted or unneeded applications. Furthermore, our results also uncovered an alarming phenomenon; 
namely, many Facebook users install new applications weekly. According to our results 389 out of 1,676 
users had increased the number of installed Facebook applications on their profiles with an average 
number of 1.91 new application installations per week. 

Eight, according to the collected users' privacy statistics we can see that almost all of the examined 
users information is available to friends, leaving the users' information exposed to fake friends (See 
Table[5]). In addition, the vast majority of examined users also set their "Default Privacy Setting" to be 
accessed by everyone. This result indicates that many users do not protect their personal information 
and leave it exposed to the public's view. 

Lastly, according to the overall results we can see that SPP software has assisted its users in 
protecting their privacy both by restricting friends and by removing unwanted Facebook applications. 
However, the SPP software did not succeeded in assisting users to improve their privacy settings in 
most cases. 



7 Conclusions 

In this study, we presented the SPP software, which aims to better protect user's privacy in Facebook. 
We presented in detail the general architecture of the SPP software (see Section [3]). According to this 
architecture, the SPP software can be divided into three layers of protection. The first layer helps to 
restrict a user's friend's access to personal information. The second layer helps to identify and warn 
the user about installed Facebook applications, which can violate the user's privacy. The third layer 
helps the user to adjust their privacy settings with one click. According to this software architecture 
the heart of the SPP software lays in the function, which is responsible for recommending to each 
user, which friends to restrict. This function can be a simple heuristic or a more complicated machine 
learning classifier. 

In the initial version of the SPP software we chose to implement the Connection-Strength heuristic. 



which was responsible for recommending to each user, which of his friends to restrict (see Section 4.1). 
By using the software interfaces and the Connection-Strength heuristic recommendations, 527 out of 
3,017 software users had restricted 9,005 friends in less than four months. According to the results of 
the Connection-Strength heuristic we can conclude that the Connection-Strength heuristic presented 
the users with a relatively remarkable recommendation. By using these recommendations the SPP 
users' had restricted 30.87% of their friends, which appeared in the first position in the application's 
Restriction interface (see Figure [6]). However, the Connection-Strength did not provide a general 
method of identifying which of the users' links need to be restricted. 

To create general link restriction recommendation methods we chose to use a supervised learning 
approach. By using the unique data, which was created by the initial SPP version, we created three 
types of datasets for different friend restriction scenarios(see Section [i]). We used these datasets to 
construct and to compare different machine learning algorithms to identify which algorithm can provide 



the best results (see Section 4.2.2). We discovered that the Rotation-Forest algorithm presented the 
best AUC and false-positive rates results on all three datasets (see Table [3]). We then showed that the 
Rotation-Forest classifiers had been created from the Fake profiles dataset and the All links dataset 



presented good average users precision at k results (see Figure 10). Furthermore, we demonstrated 
that these classifiers can provide Facebook administrators with a method, which can assist them in 
identifying fake profiles among all users in the network (see Figure [9]). However, according to our 
evaluations these classifiers suffer from relatively high false-positive rates. We believe that these false- 



23 



positive rates can be considerably reduced if the network administrator uses the classifiers to evaluate 
several links, instead of one link only, for each suspicious profile, which the classifiers had marked off 
as being a fake profile. We hope to verify this assumption in future research. 

In this study, we also collected statistics form 1,676 different Facebook users on the number of 
applications installed on their Facebook profiles during different time periods. According to these 
statistics we discovered that many Facebook users have an alarming number of applications installed 
on their profiles, where 30.31% of the users had at least forty installed applications (See Figure 11). In 
addition, our statistical analysis also showed that many users continued to install new applications with 
an average of 1.91 new applications every week. Fortunately, according to our results, we discovered 
that at least 111 SPP Add-on users had used the SPP Add-on to improve their privacy and removed at 
least 1,792 applications (see Figure 12). These results indicate that by making users more aware of the 
existence of installed applications we can assist in reducing the number of the installed applications, 
and may decrease the exposure of users' personal information to third party companies. Furthermore, 
in this study we also collected statistics on the privacy settings of 67 unique Facebook users. According 
to these privacy statistics we can conclude that a majority of the users expose their private information 
to friends, and in many cases even to the public. Once again, these statistics sharply demonstrate 
how exposed Facebook users information can be to both fake profile attacks and third party Facebook 
applications. 

In the future, we hope to continue our study and provide an updated version of the SPP Add-on, 
which will be able to support more web browsers, such as Chrome and Internet Explorer. In addition, 
we plan to remove the extraction of less useful features, like the Tagged-Video-Number feature, and 
through this improve the SPP application performance. We hope that these improvements will assist 
SPP users to restrict more fake profiles and through this increase the size of our classifiers' training 
set. 

We also believe that this study has several future research directions, which can improve the iden- 
tification of fake profiles in online social networks. A possible direction is to extract more complicated 
topological features, such as the number of communities of each user, and use them to construct better 
classifiers with lower false-positive rates. In our previous study [16j, we demonstrated that these type 
of features can assist in identifying fake profiles. Another possible direction, which can be used to 
improve the classifiers performances is to construct the classifiers by using oversampling techniques, 
like SMOTE [5], to deal with the dataset imbalance issue instead of the under sampling techniques 
we used in this study. We also hope to test the constructed classifiers performance on different on- 
line social networks, such as Google+ and Twitter. Another future direction we want to examine is 
the usage of the SPP software as an educating tool. We also hope to examine if users utilizing the 
SPP software to restrict friends and remove applications became more aware of their privacy, and as 
result tended to accept less friend requests and installed fewer applications. In future studies, we also 
hope to perform a deeper analysis on the unique datasets we obtained with the SPP software and 
extract different insights on connections between Facebook users. We also hope to test the developed 
algorithms to improve users' security and privacy in other online social networks. 



8 Availability 

The Social Privacy Protector and parts of its source code are available for download from |http : 
[77www.socialprotector.net, The Friend Analyzer Facebook application is available to download 
from |https://apps. facebook.com/friend_analyzer_app, A video with detailed explanations on 
how to use the SPP application is available in http : //www . youtube . com/wat ch?v=Uf 0LQsP4sSs 
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